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A Randomized Rounding Algorithm for Sparse PCA 

Kimon Fountoulakis* Abhisek Kundu^^ Eugenia-Maria Kontopoulou^ Petros Drineas 


Abstract 

We present and analyze a simple, two-step algorithm to approximate the optimal solution of the 
sparse PCA problem. Our approach first solves an £i-penalized version of the NP-hard sparse PCA opti¬ 
mization problem and then uses a randomized rounding strategy to sparsify the resulting dense solution. 

Our main theoretical result guarantees an additive error approximation and provides a tradeoff between 
sparsity and accuracy. Our experimental evaluation indicates that our approach is competitive in practice, 
even compared to state-of-the-art toolboxes such as Spasm. 

Keywords. Sparce pea, rounding randomized algorithm. 

1 Introduction 

Large matriees are a eommon way of representing modem, massive datasets, sinee an m x n real-valued 
matrix X provides a natural strueture for eneoding information about m objeets, eaeh of whieh is deseribed 
by n features. Prineipal Components Analysis (PCA) and the Singular Value Deeomposition (SVD) are 
fundamental data analysis tools, expressing a data mattix in terms of a sequenee of orthogonal veetors of 
deereasing importanee. While these veetors enjoy strong optimality properties and are often interpreted 
as fundamental latent faetors that underlie the observed data, they are linear eombinations of up to all the 
data points and features. As a result, they are notoriously diffieult to interpret in terms of the underlying 
processes generating the data IIMD091I . 

The seminal work of HdGJOVll introduced the concept of Sparse PCA, where sparsity constraints are 
enforced on the singular vectors in order to improve interpretability. As noted in HdGJOVl [MD091IPDK13I . 
an example where sparsity implies interpretability is document analysis, where sparse principal components 
can be mapped to specific topics by inspecting the (few) keywords in their support. Formally, Sparse PCA 
can be defined a^(see eqn. (1) in UPDKlSIl i: 

Zont = max x^Ax, s.t. l|x||o < A:. (1) 

xeR**, ||x||2<i 

In the above formulation, the parameter k controls the sparsity of the resulting vector and is part of the input; 
A = X^X E I^nxn symmetric positive semidefinite (PSD) covariance matrix that represents all 
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pairwise feature similarities for the data matrix X and Xgpt denotes a vector that achieves the optimal value 
Zopt in the above formulation. In words, the above optimization problem seeks a sparse, unit norm vector 
Xopt that maximizes the data variance x^jAxop*. The following two facts are well-known: first, solving 
the above optimization problem is NP-hard IIMWA12I and, second, this hardness is due to the sparsity 
constraint. Indeed, if the sparsity constraint was removed, then the resulting optimization problem can be 
easily solved: the optimal vector is the top (left or right) singular vector of A and the related maximal value 
is equal to the top singular value of A. Finally, we note that more than one sparse singular vectors can be 
constructed using a simple deflation procedure; see Section ITT] for details. Following the lines of IIPDK13L 
we will only focus on the formulation of eqn. ffl. 

Related work. The simplest sparse PCA approaches are to either rotate llJol951l or threshold IICJ95II the top 
singular vector of the matrix A. Such simple methods are computationally efficient and tend to perform very 
well in practice (see Section[3]). However, there exist cases where they fail (see IICJ951I and Section[3]here). 
An alternative line of research focused on solving relaxations of eqn. O- For example, an relaxation of 
eqn. ([Til was first used in SCoTLASS IIJTU03II . Another possible relaxation is a regression-type approxi¬ 
mation IIZHT061 . which was implemented in IISCLE12II . (We will compare our approach to this method.) 
Finally, efficient optimization methods have been developed for the sparse PCA problem. For example, the 
generalized power method was proposed in IJNRSIOII : this method calculates stationary points for penalized 
versions of eqn. 

Despite the many approaches that were developed for sparse PCA, only a handful of them provide any 
type of theoretical guarantees regarding the quality of the obtained (approximate) solution. For example, 
the semidefinite relaxation of HdGJO?! was analyzed in IIAW08L albeit for the special case where A is a 
spiked covariance matrix with a sparse maximal singular vector. Briefly, IIAW08I studies conditions for the 
dimensions m and n of the data matrix X, and the sparsity parameter k, so that the semidefinite relaxation 
of lldGJ07ll recovers the sparsity pattern of the optimal solution of eqn. ([ill. Other attempts for provable 
results include the work of lldBG081 . which was later analyzed in lldBG14L In the latter paper, the authors 
show bounds for the semidefinite relaxation of ldBG08l . in the special case that the data points are sampled 
using Gaussian models with a single sparse leading singular vector. Strong compressed-sensing-type condi¬ 
tions were used in IIYZ13II to guarantee recovery of the optimal solution of eqn. ([U using a truncated power 
method. However, IIYZ13II requires that the optimal solution is approximately sparse and also that the noise 
matrix has sparse submatrices with small spectral norm. Finally, IIPDK13II describes a greedy combinatorial 
approach for sparse PCA and provides relative-error bounds for the resulting solution under the assumption 
that the covariance matrix A has a decaying spectrum. It is important to note that in all the above papers 
special assumptions are necessary regarding the input data in order to guarantee the theoretical bounds. Our 
work here does not make any assumptions on the input data, but we do pay the cost of increased sparsity 
in the final solution (see Theorem [U for details). There are also connections between sparse approximations 
and subspace learning methods, which are widely used in machine learning and data mining. Recently, 
methods that enforce sparsity have been developed for Human Activity Recognition llLZW^lht and Image 
Classification iLWT+16ll . Moreover, some of these methods have been applied to multidimensional data 
IIDXXM07I . 

Our algorithm. We present and analyze a simple, two-step algorithm to approximate the optimal solution 
of the problem of eqn. ([TJ. Our approach first finds a stationary point of an £i-penalized version of problem 
Then, a randomized rounding strategy is employed to sparsify the resulting dense solution of the ii- 
penalized problem. More precisely, we first solve: 

Zopt = max x^Ax, s.t. ||x||i < s/k- (2) 

xgR", ||x|| 2 <l 
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ALGORITHM 1: Vector sparsification 

Input: X G K", integer s. 

Output: X G M", E ( |x Ig) < s. 

for i = 1,..., n do 

Set: 

X* = 1 

Axj, with probability = min 1 , 1 1 

0, otherwise 

end 



Notice that we replaced the constraint on the norm of the vector x by a (tighter) constraint on the i\ norm 
of X. It is important to mention that problem ([21) is difficult and all we can hope in practice is to calculate 
a stationary point. However, one should not discount the quality of stationary points. In Section [3] we 
show that by calculating stationary points we capture as much of the variance as computationally expensive 
convex relaxations. Having said that, the theoretical analysis which follows assumes that we work with the 
globally optimal solution of problem Q. 

Let Xopt be a vector that achieves the optimal value Zopt for problem Q. Clearly, Xopf is not necessarily 
sparse. Therefore, we employ a randomized rounding strategy to sparsify it by keeping larger entries of Xop* 
with higher probability. Specifically, we employ Algorithm [Jon the vector Xopt to get a sparse vector Xop* 
that is our approximate solution to the sparse PCA formulation of eqn. ffl. 

It is obvious that, in expectation, the vector Xop* has at most s non-zero entries. We will discuss the 
appropriate choice for s in Theorem [Ubelow. 

Our results: theory. Surprisingly, this simple randomized rounding approach has not been analyzed in 
prior work on Sparse PCA. Theorem [T] is our main theoretical result and guarantees an additive error ap¬ 
proximation to the NP-hard problem of eqn. ([Til. For simplicity of presentation, we will assume that the 
rows (and therefore columns) of the matrix A have at most unit norm^ 


Theorem 1 Let Xopt be the optimal solution of the Sparse PCA problem of eqn. (|7|) satisfying ||xopt ||2 = 1 
and ||xopt||Q < k. Let icopt be the vector returned when Algorithm\I\is applied on the optimal solution Xopt 
of the optimization problem of eqn. (|2]), with s = 200A:/e^, where e G (0, 1] is an accuracy parameter. Then, 
Xopt has the following properties: 

1 . E 11 X(5p^ 11Q ^ s. 

2. With probability at least 3/4, 

— f “h 0.156. 


3. With probability at least 3/4, 


^opA^opt > yilpt^^opt - e 


(3) 


In words, the above theorem states that our sparse vector -kopt is almost as good as the optimal vector Xopt 
in terms of capturing (with constant probability) almost as much of the spectrum of A as Xopj does. This 
comes at a penalty: the sparsity of Xopt, which is equal to k, has to be relaxed to O [k/ e^) . This provides an 
elegant trade-off between sparsity and accuracjH. It is worth emphasizing that one should not worry about 

^We can relax this assumption by increasing s - our sampling factor - by a factor that depends on the upper bound of the row 
and column norms of A. 

less important artifact of our approach is the fact that the Euclidean norm of the vector Xopt is slightly larger than one. 
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the small success probability of our approach: by repeating the rounding t times and keeping the vector 
Xopi that satisfies the second bound and maximizes Axopt, we can immediately guarantee that we will 
achieve both bounds with probability at least 1 — 2“*. 

Our results:experiments. We empirically evaluate our approach on real and synthetic data. We chose to 
compare our algorithm with the state-of-the-art Spasm toolbox of IISCLE121IZHT06I . We also compare 
our solution with the simple MaxComp heuristic, which computes the top singular vector of matrix A and 
returns a sparse vector by greedily keeping the top k largest components (in absolute value) and setting the 
remaining ones to zero. Our empirical evaluation indicates that our simple, provably accurate approach is 
competitive in practice. 

2 Proof of Theorem 1.1 

2.1 Preliminaries 

Consider the indicator random variables for alH = 1... n which take the following values: 

^ _ i j- ) with probability pi 
* \ 0 , otherwise 

It is easy to see that x* = (5jXj for alH = 1. .. n. The following trivial properties hold for all i and will 
be used repeatedly in the proofs: E(5j = 1,E(1 — 5,) = 0, and E(l — = p~^ — 1 . For simplicity of 

notation we will drop the the subscript opt from ^opt, ^opt, and Xopf in all proofs in this section. 

2.2 A bound for 11 Xopt 11 q 

Recall that we will use the sampling probabilities pi as defined in Algorifhm [T] By definition, pi < 
s |xj| / ||x||i, fherefore 

n n I I 

La. -I 

i=l i=l " "1 

which proves fhe firsf bound in Theorem [T] 

2.3 A bound for ||xopj ||2 

The following lemma immediately implies fhe second bound of Theorem [T] by selling s = 200/c/e^. 
Lemma 1 Given our notation, with probability at least 3/4, 


X, 


opt II2 


< 1 -h 2 
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Proof: It is more intuitive to provide a bound on the expeetation of ||x — x ||2 and then leverage the triangle 
inequality in order to bound ||x|| 2 - Using the indieator variables 5i and linearity of expeetation, we get 


i=l 

n 




We will now prove the following inequality, whieh will be quite useful in later proofs: 




--1 ± t <-. 
Pi J s 


Towards that end, we will split the set of indiees {1... n} in two subsets: the set eorresponding to 
indiees i sueh that pi = I and the set eorresponding to indiees i sueh that p* < 1. Note that for all 
i £ it must be the ease that 


We now proeeed as follows: 


IE||x-x ||2 = 




EU-i 

E 

.tii- ”• 


s ~ s 

For the last inequality we used the faet that ||x||^ < Vk. We now use Markov’s inequality to eonelude that, 
with probability at least 3/4, 

l|x-k||^<-. (6) 

s 

To eonelude the proof note that, from the triangle inequality. 


|x-x||2 > I ||k||2 - IIXII 2 I 
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and thus 


||x ||2 < ||x ||2 + ||x — x ||2 < 1 + ||x — x ||2 • 

Combining with eqn. after taking the square root of both sides, concludes the proof of the lemma. o 


2.4 Proving eqn. (|3l) 


The following lemma states that the solution of the convex relaxation of the Sparse PCA problem is at least 
as good as the solution of the Sparse PCA problem. 

Lemma 2 Given our notation, 'X-opt A a feasible solution of the relaxed Sparse PCA formulation of eqn. ([2]). 
Thus, Zopt = > xJp^Axopt- 

Proof: Recall that -x.opt is a unit norm vector whose zero norm is at most k. Then, if we let sgn(x) denote 
the vector of signs for x (with the additional convention that if Xj is equal to zero then the i-th entry of 
sgn(x) is also set to zero), we get 


X, 


opt 111 


sgn(xopj)^Xopt < ||sgn(xopj )||2 ||x, 


opt 


12 < Vk. 


The second inequality follows since sgn(xopt) has at most k non-zero entries. Thus, Xopj is feasible for the 
optimization problem of eqn. Q and the conclusion of the lemma follows immediately. o 

Getting a lower bound for x^Ax is the toughest part of Theorem [U Towards that end, the next lemma 
bounds the error |x^jAxopt — x^^Axopj| as a function of two other quantities which will be easier to 
bound. 


Lemma 3 Given our notation, 

I ' 7 ^ I I I 

l^opt-^^opt XopiAXopiI < 2 |Xqp^A (fi-opt Xopi)| A | (ftopt ^opt) ^ (Aopt ^opt) 


Proof: We start with 


k^Ai-k^Ai = x^ Ak-x^ Ak + k^ Ax-i^ Ax 


< 


fk - x)^ Ai 


+ |x^A (x - x)| . 


Next, 


(k-x)^A(k-k) = |k'^A(x-k)-x^A(x-x)| 

> ||x'^A(k-x)| - |k^A(k-x)|| , 

which implies 

|x^A(x —x)| < |x'^A(x —x)|+ (x —x)^A(x —x) 

We now combine eqns. Q and (HJl to get 


(V) 

( 8 ) 

(9) 


li^Ak - k^Akl < 


(k-k)^Ai + |i'^A(k-x)| + (i - i)'^ A (i - x) 
= 2|x^A(x —x)|+ (x — x)^ A (x — x) 


Our next lemma will provide a bound for the first of the two quantities of interest in Lemma |3] 
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Lemma 4 Given our notation, with probability at least 7/8, 


l^optA (Xopt - Xopj)| < ^/8k/s. 

Proof: Let D G be a diagonal matrix with entries Ha = for alH = 1... n. Hence, we can write 

X = Dx. We have that 


(x — x)^ Ax = x"^ {In — D) Ax = x"^ 


(1 - (5i)Ai*x 

(1 - (52)A2*X 


(1 


n 

^ (1 - 6i) XjAj^x, 

i=l 


where Aj* is the i-th row of the matrix A as a row vector. Squaring the above expression, we get 


((x - x)^Ax)^ ^ ^ (xjA^x) (xj Aj*x) (1 - (5i) (1 - 6j) . (10) 

i=i j=i 

Recall that E (1 — 5j) = 0 for all z; thus, for all i j, 1 — di and 1 — 6j are independent random variables 
and therefore the expectation of their product is equal to zero. Thus, we can simplify the above expression 
as follows: 


E((i- 



n 

= E (1 - 6if {■XjAj^x)'^ 

i=l 

= S (b ■ 0 



In the last inequality we used |Aj*x| < ||Aj *||2 ||x ||2 < 1- The last term in the above derivation can be 
bounded as shown in eqn. ([5]), and thus we conclude 


E 


X — X 


Ax^ < k/s. 


Markov’s inequality now implies that with probability at least 7/8 


X — X 


Ax) < 8k/s. 


o 

Our next lemma will provide a bound for the second of the two quantities of interest in Lemma |3] The 
proof of the lemma is tedious and a number of cases need to be considered. 

Lemma 5 Given our notation, with probability at least 7/8, 


{±opt - ^optf A (Xopt - Xopt) < ^24A:^/s^ + Qk^ / + M\/k/s^ 


1/2 
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Proof: Using the indicator variables 6i and linearity of expectation, we get 


2 


E 



= E| ^ I = 


n 

X E ((1 — (5ij) (1 — di^) (1 — (1 — Sj^)). 

nji,*2 j'2=i 


( 11 ) 


Recall that E (1 — 5j) = 0 for all i. Notice that if any of the four indices ii, ^ 2 , ji, J 2 appears only once, then 
the expectation E ((1 — (5jJ (1 — di^) (1 — (1 — Sj^)) corresponding to those indices equals zero. This 

expectation is non-zero if the four indices are paired in couples or if all four are equal. That is, non-zero 
expectation happens if 


(A) : ^1=^2/ ji = J2 

(B) : ii = ji j^i2= J2 

(C) : h = j2 ^12= ji 

(D) : ii = i2= ji = j2 


(n^ — n terms) 
(n^ — n terms) 
(n^ — n terms) 
(n terms). 


For case (A), let ii = i 2 = k and let ji = j 2 = I, in which case the corresponding terms in eqn. CB 
become (notice that <5^ and bi are independent random variables since k i)\ 


n 

^ ((1 - (1 - <5,)') = 

k,i=i, k^e 



( 12 ) 


In the first inequality we used | A^^l < 1 for all k and I and added extra positive terms (corresponding to 
k = t), which reinforce the inequality. The last inequality follows from eqn. (I5]l- 

For case (B), let ii = ji = k and let i 2 = j 2 = f in which case the corresponding terms in eqn. CB 
become (notice that 5k and 5i are independent random variables since k i): 


n 

AfcfcA«x|x|E ^(1 — 5k)^ (1 — = 

^ AkkAu^lxj (— -l) (—-l) < 

k,e^k^i VPfc J \Pi J 
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( 13 ) 



In the first inequality we used Akk < 1 for all k and the fact that the diagonal entries of a symmetric positive 
definite matrix are non-negative, which allows us to add extra positive terms (corresponding to k = t) to 
reinforce the inequality. The remainder of the derivation is identical to case (A). 

For case (C), let A = j 2 = k and let i 2 = ji = I, in which case the corresponding terms in eqn. CD 
become (notice that bk and bi are independent random variables since k ^ i): 

n 

(^(1 - bkf (1 - bif^ = 

k,l=l, 



In the first equality we used the fact that A is symmetric; the remainder of the derivation is identical to 
case (A). 

Finally, for case (D), let fi = i 2 = ji = 32 = k, in which case the corresponding terms in eqn. (fTTl) 
become: 


k=l 


< 


< 



4 1 

-2 + “3 “ ^ 

Pk Pk 


^k 


Pk pI ' pI 


E 

k=l 

ta<, yn fi pI 



In the above derivation, we used | A^fcl < 1. We also split the set of indices {1... n} in two subsets: the 
set corresponding to indices k such that pk = I and the set corresponding to indices k such that 
Pk < 1. Note that for all k E it must be the case that pk = s |xj| / ||x||^. Thus, 


y]ALi|E(i- 4 )" < 

k=l 

< 


kei<y 


( 6||x||i ||x||i \ 

\s|xfc| 



k=l 


( 6||x||i l|x||i \ 

\s|xfc| s 3 |xfc|V 



k=l 





k=l 



Recall that ||x||g < ||x||p for any 1 < p < <? < 00 and use ||x II2 — 1 and ||x||^ < Vk to get 
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n 


1 


k=l 


6 ||x|| 



s 


+ 


X 


4 

1 


6^/k k‘^ 

— -'- 3 • 

S S-^ 

Combining all four cases (i.e., eqns. (fT^ . ([T3] ). (12.41 ). and (fT4]) l. we get 


E 


^(x — x)"^ A (x — x)^ < 2>k ^/+ Qy/k/s. 


( 14 ) 


Using Markov’s inequality and taking square roots concludes the proof of the lemma. o 

We can now complete the proof of the lower bound for x^^Axopj. First, combine Lemmas |3l|4l and|5] 
to get 


|xoptAxopt - XoptAxoptI < 2^/Sk/s + {2Ak^/s^ + Qk^/+ My/k/s^ ^ . 

Since each of Lemmas |4] and |5] fail with probability at most 1/8, it follows from the union bound that 
the above inequality holds with probability at least 1 — 2(1/8) = 3/4. Therefore, setting s = 200fc/e^ 
guarantees (after some algebra) that with probability at least 3/4, 

l^opA^opt - iopA^opt \ < e. 

Using the triangle inequality and Lemma|2]we get that with probability at least 3/4 

^opt-^^opt ^ ^opt-^^opt ~ ^ ^opt-^^opt ~ 

which concludes the proof of eqn. dS]). 


3 Experiments 

We perform experiments on both real and synthetic datasets. We chose to compare our algorithm with 
the solution returned by the state-of-the-art Spasm toolbox of US CUE 121 . which implements the approach 
proposed in IIZHT061 . We also compare our solution with the simple MaxComp heuristic BCJ95I : this 
method computes the top singular vector of matrix A and returns a sparse vector by keeping the top k 
largest components (in absolute value) and setting the remaining ones to zero. 

3.1 Experimental setup 

Let X G with m denote the object-feature data matrix, where every column has zero mean, and 

recall that A = X^X in eqn. (O. We use the following function to measure the quality of an approximate 
solution X G to the sparse PCA problem: 

/(x) = x'^Ax/IIAII 2 . (15) 

Notice that 0 < /(x) < 1 for all x which satisfy ||x ||2 < 1. The closer /(x) is to one the more the vector x 
is parallel to the top singular vector of A, or, equivalently, the closer /(x) is to one the more x captures the 
variance of the matrix A that corresponds to its top singular value. Our goal is to calculate sparse vectors x 
with /(x) PS 1. 
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ALGORITHM 2: Computing k sparse principal components 
Input: X G integer k. 

Output: U = V = {vi,...,Vfc}. 

Y = vi = rspca(X) and Ui = rspca(Y) 

for i = 2,.. ., fc do 

X = X - Xv,_iv7_i and Y = Y - Yui_iU,Li 
Vi = rspca(X) and = rspca(Y) 


Our approach first finds a stationary point of the optimization problem of eqn. (|2ll and then uses Algo¬ 
rithm [T] (with s = k) to obtain a sparse solution vector Xopf G M”. We note that the chosen value of s is 
much smaller than the O {k/e^) value stipulated by Theorem [T] Also, in practice, our choice of s works 
very well and results in solutions that are comparable or better than competing methods in our data. 

We note that in our use of Spasm, we used soft thresholding by setting the STOP parameter to — ||xopj ||q 
and 6 = — oo (both STOP and 6 are parameters of Spasm toolbox). This implies that the solutions obtained 
by Spasm and our approach will have the same number of non-zeros, thus making the comparison fair. 
Similarly, for MaxComp, after computing the top singular vector of A, we select the ||xopt||Q largest (in 
absolute value) coordinates to form the sparse solution. A final technical note is that the solutions obtained 
using either our method or MaxComp may result in vectors with non-unit Euclidean norms. In order to 
achieve a fair comparison in terms of eqn. ([T5] ). there are two options. The first one {naive approach) is to 
simply normalize the resulting vectors. However, a better approach (SVD-based) is possible: given a sparse 
solution vector x^pf, we could keep the rows and columns of A that correspond to the non-zero entries in 
Xopt and compute the top singular vector of the induced matrix. Note that the induced matrix would be a 
||xopf llo X ll^opf llo matrix and its top singular vector would be a ||xopf ||Q-dimensional vector. Obviously, 
this latter vector would be a unit norm vector and it could be padded with zeros to derive a unit norm vector 
in M” with the same sparsity pattern as ^opt- It is straight-forward to argue that this vector would improve 
the value of / compared to the naive normalized vector ^opt/ ||xopt|| 2 - In our experimental evaluation, we 
will evaluate both the naive and the SVD-based normalization methods. 

We also compare the methods based on their wall-clock running times. All experiments were run on a 
Intel(R) Core(TM) i7-6700 machine running at 3.40GHz, with 16 GB of RAM. 

In some of our experiments we will need to extract more than one sparse singular vectors. Towards that 
end, we can use a deflation approach (Algorithm ^ to construct more than one sparse singular vectors by 
first projecting the residual matrix into the space that is perpendicular to the top sparse singular vector and 
then computing the top sparse singular vector of the residual matrix. In Algorithm [H rspca refers to the 
solution of the optimization problem of eqn. (O followed by Algorithm [I]). 

3.2 Data Sets 

We used 22 matrices emerging from population genetics, namely the 22 matrices (one for each chro¬ 
mosome) that encode all autosomal genotypes that are available in both the Human Genome Diversity 
Panel IICon071l and the HAPMAP llLAT~*~08l project. Each of these matrices has approximately 2,500 sam¬ 
ples (objects) described with respect to tens of thousands of features (Single Nucleotide Polymorphisms or 
SNPs); see IIPEJDIOI for details. We also used a gene expression dataset (GSE10072, lung cancer gene 
expression data) from the NCBl Gene Expression Omnibus database. This dataset includes 107 samples (58 
lung tumor cases and 49 normal lung controls) measured using 22,215 probes (features) from the GPE96 
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platform annotation table. Both the population genetics and the gene expression datasets are interesting 
in the context of sparse PCA beyond numerical evaluations, since the sparse components can be directly 
interpreted to identify small sets of SNPs or genes that capture the data variance. 

Our synthetic dataset has been carefully designed in order to highlight a setting where the MaxComp 
heuristic will fail. More specifically, the absolute values of the entries of the largest singular vector of a 
matrix in this family of matrices is not a good indicator of the importance of the respective entry in a sparse 
PCA solution. Instead, the vector that emerges from the optimization problem of eqn. (|2ll is a much better 
indicator. For a detailed description of the synthetic data generator see Section IATT] in the Appendix. 

Our third data set comes from the field of text categorization and information retrieval. In such appli¬ 
cations, documents are often represented as a “bag of words” and a vector space model is used. We use a 
subset of the Classic-30 document collection, which we will call Classic-2. This subset consists of the CISI 
(Comites Interministeriels pour la Societe de ITnformation) collection (1,460 information science abstracts) 
and the CRANFIELD collection (1,398 aeronautical systems abshacts). We created a document-by-term 
mahix using the Text-to-Matrix Generator (TMG) IIZG06I the final mahix is a sparse 2,858 x 12,427 ma- 
hix with entries between zero and one, representing the weight of each term in the corresponding document. 


3.3 Results 

First we compare the performance of the different methods on a synthetic dataset, using the data generator 
which was described in Section [T2] with m = 2^, n = 2^^, and 9 ps 0.277r. In Figure [Ta] we plot (in 
the y-axis) the value of the performance ratio /(x) (as defined in eqn. (ITSl) ) for our method (rspca), the 
Spasm software, and the MaxComp heuristic. We also note that for each of the three methods, we use two 
different approaches to normalize the resulting sparse vector: the naive and the SVD-based ones (see the 
last paragraph in Section iTTI) . As a result, we have a total of six possible methods to create and normalize a 
sparse vector for sparse PCA. The x-axis shows the sparsity ratio of the resulting vector, namely ||xopi Hq 
W e remark that all six methods produce sparse vectors with exactly the same sparsity in order to have a fair 
comparison. Notice that in Figure [lil the MaxComp heuristic has worse performance when compared to 
either our approach or to Spasm: this is expected, since we conshucted this family of matrices in order to 
precisely guarantee that the largest components of the top singular vector would not be good elements to 
retain in the conshuction of a sparse vector. To further visualize this, we look at the sparse vectors, returned 
by the different methods, in Figure|4]in the Appendix. In this figure, we present the resulting sparse vectors 
for each of the three methods (normalization, obviously, is not relevant in Figure SJl for a particular choice 
of the sparsity ratio ( ||xopt||Q /n « 0.1). Notice that MaxComp fails to capture the right sparsity pattern, 
whereas our method and Spasm succeed. 

In the second experiment, we present the performance of the different methods on the real datasets 
described in Section |T2] The results are shown in Figures [T3[Tel and[Tdl (We only show results for the first 
two chromosomes of the joint HapMap and HGDP datasets; the other 20 chromosomes behave very similarly 
and are shown in Figures [5l [6l |7j and [8] in the Appendix). We note that in the population genetics data, 
our method has approximately the same or better performance compared to both MaxComp and Spasm. 
Not surprisingly, the naive normalization approach is consistently worse than the SVD-based one. It is 
worth noting that our SVD-based normalization approach easily improves the output of Spasm. This is 
because Spasm does detect the correct sparsity pattern but fails to compute the appropriate coefficients of 
the resulting sparse vectors. 

''ftp://ftp.cs.cornell.edu/pub/smart/ 

^The matrix was created using the stoplist provided by TMG, the tf-idf weighting scheme, and document normalization. We 
removed numbers and alphanumerics and we didn’t use stemming. 
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Il^opf ||o/« 

(a) Synthetic data: n = 2^ 




rspca (SVD-based) 
Spasm (SVD-based) 
^ MaxComp (SVD-based) 
rspca (naive) 

^ -Spasm (naive) 
^-MaxComp (naive) 


(b) HapMap+HGDP data (chromosome 1): n = 37493. 



Ipopi h/n 

(c) HapMap+HGDP data (chromosome 2): n = 40844. (d) Gene Expression: n = 22, 215. 

Figure 1: /(x) vs. sparsity ratio ||xopt||g In for various real and synthetic datasets. 


In Figure |2] (and Figures |9l [TOl [H] at the Appendix) we plot (in the y-axis) the running time for our 
method (rspca), the Spasm software, and the MaxComp heuristic. We also note that for each of the 
methods we use the naive approach (see the last paragraph in Section [TT]) . The x-axis shows the sparsity 
ratio of the resulting vector. Our algorithm presents a variable time performance that tends to decrease as the 
sparsity ratio increases. This is an artifact of the projected gradient ascent algorithm IIJNRSIOI that we used 
to find stationary points for the problem of eqn. (jj)). The smaller k is in (Illl, the harder the problem of (O 
becomes for projected gradient ascent. In practice we observed that the smaller k is the more iterations were 
needed for projected gradient ascent to converge since the algorithm balances between the standard PCA 
objective and the sparsity constraint. Projected gradient ascent, which solves problem ((H) tends to be slower 
for small k than the algorithm implemented in Spasm for solving the problem in 1ZHT061 . We anticipate 
that this issue can be solved by using more sophisticated optimization algorithms, we leave this for future 
work. Notice that MaxComp is the fastest method, but based on Figure [Tal there are cases where it fails to 
find a good solufion. The running time of MaxComp and Spasm appears fo be consfanf. This is because 
bofh mefhods require calculafion of fhe firsf principal componenf which dominafes fhe compufafional cosf. 
In particular, MaxComp is compufing fhe firsf principal componenf which is fhen sparsified, while Spasm 
is compufing fhe firsf principal componenf fo inifialize fhe algorifhm fhaf solves fhe problem in IIZHT061 . 
All algorifhms fake info accounf fhe sparsify of fhe dafa. 
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(a) HapMap+HGDP data (chromosome 1): n = 37493. 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

\\xopt \\o/n 

(c) HapMap+HGDP data (chromosome 3): n = 34258. 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

\\xopt \\o/n 

(h) HapMap+HGDP data (chromosome 2): n = 40844. 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

II0/17 


(d) HapMap+HGDP data (chromosome 4): n = 30328. 


Figure 2: Running time vs. sparsity ratio ||xopt||Q /n for real data. 


Finally, we evaluate the performance of our algorithm in a text mining application, using the Classic-2 
document collection. First, we run Algorithm |2] to obtain two sparse singular vectors and we use the number 
of their non-zero entries to compute two sparse singular vectors for MaxComp and Spa srrB This way we 
can guarantee the same sparsity levels for all three pairs of singular vectors. We repeat this procedure for 
8 different values of k (sparsity parameter in eqn. ([Hi). Table [U summarizes, for each k, the variance and 
the sparsity (in parenthesis) captured by the top two principal components using PCA, randomized sPCA 
(rspca), MaxComp heuristic and Spasm or solving eqn. dH (cvx). 

^Spasm does not support sparse input matrices. 
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Table 1: Variance and sparsity captured by the principal components. PCA results in dense principal com¬ 
ponents, while Spasm and MaxComp share the same sparsity with rspca. 



k 

pea 

cvx 

rspca 

MaxComp 

Spasm 

Top Principal Comp. 

100 

0.4351 

0.3077 (99%) 

0.2942 (99%) 

0.1955 

0.2768 

Top two Principal Comp. 

0.6802 

0.4897 (99%) 

0.4680 (99%) 

0.3391 

0.4227 

Top Principal Comp. 

500 

0.4351 

0.3880 (95%) 

0.3728 (98%) 

0.3353 

0.3601 

Top two Principal Comp. 

0.6802 

0.6073 (95%) 

0.5864 (98%) 

0.5399 

0.5701 

Top Principal Comp. 

1000 

0.4351 

0.4136 (90%) 

0.4005 (95%) 

0.3825 

0.3912 

Top two Principal Comp. 

0.6802 

0.6486 (90%) 

0.6294 (95%) 

0.6074 

0.6163 

Top Principal Comp. 

1500 

0.4351 

0.4242 (84%) 

0.4120 (93%) 

0.4013 

0.4039 

Top two Principal Comp. 

0.6802 

0.6649 (82%) 

0.6470 (93%) 

0.6342 

0.6361 

Top Principal Comp. 

2000 

0.4351 

0.4295 (75%) 

0.4190 (91%) 

0.4133 

0.4131 

Top two Principal Comp. 

0.6802 

0.6730 (70%) 

0.6572 (91%) 

0.6503 

0.6491 

Top Principal Comp. 

4000 

0.4351 

0.4350 (6%) 

0.4278 (81%) 

0.4284 

0.4271 

Top two Principal Comp. 

0.6802 

0.6801 (3%) 

0.6700(81%) 

0.6710 

0.6690 

Top Principal Comp. 

8000 

0.4351 

0.4351 (0%) 

0.4324 (68%) 

0.4326 

0.4316 

Top two Principal Comp. 

0.6802 

0.6802 (0%) 

0.6764 (69%) 

0.6768 

0.6752 

Top Principal Comp. 

10500 

0.4351 

0.4351 (0%) 

0.4332 (63%) 

0.4333 

0.4324 

Top two Principal Comp. 

0.6802 

0.6802 (0%) 

0.6776 (64%) 

0.6778 

0.6764 


It seems that Spasm and MaxComp capture less variance than the randomized sPCA. Furthermore, the 
variance captured by randomized sPCA is constantly close to the one captured by the solution of eqn. ©, 
but with a sparser component as Table[T]indicates. 

Table |2] summarizes the terms with non-zero weights in randomized sPCA principal components with 
sparsity parameter k = 100. The terms are ranked in descending order with respect to their weights. Notice 
that the first principal component reveals terms that appear more often in the CRANFIELD collection while 
the second principal component reveals terms that appear mostly in the CISI collection. CRANFIELD’s 
terms are more singular than these of CISFs and they tend to dominate the singular vectors since they tend 
to appear more in the documents associated with the CRANEIELD collection than in the entire CLASSIC- 
2 collection (e.g., the word “boundary” has one appearance in CISI and 459 in CRANEIELD). The exact 
opposite happens for terms in CISI: a significant amount of these terms appear with high weights in both col¬ 
lections (e.g., the word “information” has 664 appearances in CISI and 44 in CRANEIELD). This indicates 
that these terms will appear in singular vectors that do not separate the two collections. 
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Table 2: Non-zero terms of the randomized sPCA when k = 100 


1st Principal Component 

2nd Principal Component 

boundary 

information 

layer 

library 

heat 

retrieval 

flow 

systems 

transfer 

system 

laminar 

scientific 

plate 

science 

shock 

libraries 

turbulent 

research 

hypersonic 

layer 

pressure 

services 

temperature 

boundary 

wall 

indexing 

flat 

search 

surface 

literature 

mach 

heat 

friction 

university 

compressible 

documents 

velocity 

users 

reynolds 

classification 

stagnation 

technology 

skin 


number 


stream 


equations 


transition 


solution 


viscous 


layers 


separation 


solutions 


gradient 


injection 


dimensional 


incompressible 


fluid 


edge 


cylinder 


shear 


point 


interaction 


numbers 


body 


wave 


method 


leading 


region 


bodies 


supersonic 


gas 


measurements 


approximate 


local 


equation 


case 


constant 
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4 Open problems 


From a theoretical perspective, it would be interesting to explore whether other relaxations of the sparse 
PCA problem of eqn. ([Til, combined with randomized rounding procedures, could improve our error bounds 
in Theorem [U It would also be interesting to formally analyze the deflation algorithm that computes more 
than one sparse singular vectors in a randomized manner (Algorithm [2]l. Finally, from a complexity theory 
perspective, we are not aware of any inapproximability results for the sparse PCA problem; to the best of 
our knowledge, it is not known whether relative error approximations are possible (without any assumptions 
on the input matrix). 
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APPENDIX 


A.l Synthetic data generator 

In order to generate our synthetic dataset, we generated the following matrix: 

X = USVT + E^, 


where U G and V G are orthonormal. The matrix X G has m distinct singular values 

Ui in its diagonal and the matrix G is a noise matrix parameterized by cr > 0. 

We set U to be a Hadamard matrix with normalized columns; we set S to have entries cti = 100 and 
Ui = 1/e* for alH = 2,..., m. The entries of the matrix follow a zero-mean normal distribution with 
standard deviation a = 10“^. We now describe how to construct the matrix V: we set V = Gd{0)Y, where 
V is also a Hadamard matrix with normalized columns. Here G„(0) is a composition of Givens rotations. 
In particular, Gn{0) is a composition of n/4 Givens rotations with the same angle 9 for every rotation. More 
precisely, let G{i,j, 9) G be a Givens rotation matrix, which rotates the plane i-j by an angle 9\ 


1 ••• 0 ••• 0 ••• 0 




0 

0 


Cl 


C2 


-C2 • • • 0 

Cl • • • 0 


0 ••• 0 ••• 0 ••• 1 


where i,j G {1,..., n}, ci = cos 9 and C 2 = sin 9. We define the composition as follows: 


Gn{9) = G{h,ju9)G{i2,j2,9) 


) G{iki jki9\' • • , Gr(f^y'4, jnjA-i 


with 

Zfc = - + 2A:-1, jk = - + 2k for /c = 1,..., 

The matrix Gn{9) rotates (in a pairwise manner) the bottom n/2 components of the columns of V. Since 
the Hadamard matrix has entries equal to -i-l or -1 (up to normalization), we will pick a value of 9 that 
guarantees that, after rotation, n/4 components of the columns of V will be almost zero. Thus, the resulting 
matrices will have about a quarter of components set at a large value, a quarter of their components set 
at roughly zero, and the rest set at a moderate value. For example, let n = 2^^ and 9 k, 0.277r; then, 
the difference between the first column of matrix V and V is presented in Figure [3l where we plotted the 
(sorted) absolute values of the components of the first column of the matrices V and V. 


18 
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Qj 0.015 - 


^—without rotation 
' ■' ■' with rotation 


1500 2000 2500 

index 


3000 3500 4000 


Figure 3: The red dashed line corresponds to the sorted absolute values of the components of the first 
column of matrix V. Similarly, the blue line corresponds to the first column of V. 
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(a) Actual eigenvector 




(b) rspca 
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0.02 

0 
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(c) Spasm 
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(d) MaxComp 


Figure 4: Sparse PCA solutions for a synthetic experiment. 
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(a) HapMap+HGDP data (chromosome 3): n = 34258. 


(b) HapMap+HGDP data (chromosome 4): n = 30328. 




(c) HapMap+HGDP data (chromosome 5): n = 31479. 


(d) HapMap+HGDP data (chromosome 6): n = 32800. 



(e) HapMap+HGDP data (chromosome 7): n = 27130. 

Figure 5: Performance of sparse PC A algorithms on 


(f) HapMap+HGDP data (chromosome 8): n = 28508. 

additional population genetics data (chromosomes 3-8). 
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(a) HapMap+HGDP data (chromosome 9): n = 24070. 


(b) HapMap+HGDP data (chromosome 10): n = 26327. 




(c) HapMap+HGDP data (chromosome 11): n = 24440. (d) HapMap+HGDP data (chromosome 12): n = 24084. 




(e) HapMap+HGDP data (chromosome 13): n = 18958. (f) HapMap+HGDP data (chromosome 14): n = 16469. 


Figure 6: Performance of sparse PC A algorithms on additional population genetics data (chromosomes 
9-14). 
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(a) HapMap+HGDP data (chromosome 15): n = 15351. (b) HapMap+HGDP data (chromosome 16): n = 15289. 




(c) HapMap+HGDP data (chromosome 17): n = 12945. 


(d) HapMap+HGDP data (chromosome 18): n = 15373. 




(e) HapMap+HGDP data (chromosome 19): n = 8465. (f) HapMap+HGDP data (chromosome 20): n = 13015. 


Figure 7: Performance of sparse PCA algorithms on additional population genetics data (chromosomes 
15-20). 
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(a) HapMap+HGDP data (chromosome 21): n = 7556. 



(b) HapMap+HGDP data (chromosome 22): n = 7334. 


Figure 8: Performance of sparse PC A algorithms on additional population genetics data (chromosomes 
21 - 22 ). 
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(a) HapMap+HGDP data (chromosome 5): n = 
31479. 
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(c) HapMap+HGDP data (chromosome 7): n = 
27130. 
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(b) HapMap+HGDP data (chromosome 6): n = 
32800. 
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(d) HapMap+HGDP data (chromosome 8): n = 
28508. 
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(e) HapMap+HGDP data (chromosome 9): n = 
24070. 
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(f) HapMap+HGDP data (chromosome 10): n 
26327. 


Figure 9: Running time of sparse PC A algorithms on additional population genetics data (chromosomes 
5-10). 
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(a) HapMap+HGDP data (chromosome 11): n = 
24440. 
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(b) HapMap+HGDP data (chromosome 12): n = 
24084. 
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(c) HapMap+HGDP data (chromosome 13): n = 
18958. 
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(d) HapMap+HGDP data (chromosome 14): n = 
16469. 
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(e) HapMap+HGDP data (chromosome 15): n 
15351. 
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(f) HapMap+HGDP data (chromosome 16): n 
15289. 


Figure 10: Running time of sparse PCA algorithms on additional population genetics data (chromosomes 
11-16). 
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(a) HapMap+HGDP data (chromosome 17): n = 
12945. 
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(c) HapMap+HGDP data (chromosome 19): n = 
8465. 
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(b) HapMap+HGDP data (chromosome 18): n = 
15373. 
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(d) HapMap+HGDP data (chromosome 20): n = 
13015. 
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(e) HapMap+HGDP data (chromosome 21): n 
7556. 
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(f) HapMap+HGDP data (chromosome 22): n 
7334. 


Figure 11: Running time of sparse PCA algorithms on additional population genetics data (chromosomes 
17 - 22 ). 
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